{
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Feature Transformation"
      ],
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": [
        "Feature Transformation (Data Preprocessing) is transformation of data to improve the accuracy of the algorithm.\n",
        "\n",
        "Normalization and changing distribution(Scaling), Interactions and Filling in the missing values.\n",
        "\nTransforms data such as Rescale Data, Standardize Data, Normalize Data, and Binarize Data (Make Binary)"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
        "import pandas as pd\n",
        "\n",
        "import warnings\n",
        "warnings.filterwarnings(\"ignore\")\n",
        "\n",
        "# fix_yahoo_finance is used to fetch data \n",
        "import fix_yahoo_finance as yf\n",
        "yf.pdr_override()"
      ],
      "outputs": [],
      "execution_count": 1,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# input\n",
        "symbol = 'AMD'\n",
        "start = '2014-01-01'\n",
        "end = '2018-08-27'\n",
        "\n",
        "# Read data \n",
        "dataset = yf.download(symbol,start,end)"
      ],
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[*********************100%***********************]  1 of 1 downloaded\n"
          ]
        }
      ],
      "execution_count": 2,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Add more data\n",
        "dataset['Increase_Decrease'] = np.where(dataset['Volume'].shift(-1) > dataset['Volume'],'Increase','Decrease')\n",
        "dataset['Buy_Sell_on_Open'] = np.where(dataset['Open'].shift(-1) > dataset['Open'],1,0)\n",
        "dataset['Buy_Sell'] = np.where(dataset['Adj Close'].shift(-1) > dataset['Adj Close'],1,0)\n",
        "dataset['Returns'] = dataset['Adj Close'].pct_change()\n",
        "dataset['Average'] = dataset[['Open','High','Low','Adj Close']].mean(axis=1)\n",
        "dataset['Std'] = dataset[['Open','High','Low','Adj Close']].std(axis=1)\n",
        "dataset = dataset.dropna()\n",
        "dataset.head()"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 3,
          "data": {
            "text/plain": [
              "            Open  High   Low  Close  Adj Close    Volume Increase_Decrease  \\\n",
              "Date                                                                         \n",
              "2014-01-03  3.98  4.00  3.88   4.00       4.00  22887200          Increase   \n",
              "2014-01-06  4.01  4.18  3.99   4.13       4.13  42398300          Increase   \n",
              "2014-01-07  4.19  4.25  4.11   4.18       4.18  42932100          Decrease   \n",
              "2014-01-08  4.23  4.26  4.14   4.18       4.18  30678700          Decrease   \n",
              "2014-01-09  4.20  4.23  4.05   4.09       4.09  30667600          Decrease   \n",
              "\n",
              "            Buy_Sell_on_Open  Buy_Sell   Returns  Average       Std  \n",
              "Date                                                                 \n",
              "2014-01-03                 1         1  0.012658   3.9650  0.057446  \n",
              "2014-01-06                 1         1  0.032500   4.0775  0.092150  \n",
              "2014-01-07                 1         0  0.012107   4.1825  0.057373  \n",
              "2014-01-08                 0         0  0.000000   4.2025  0.053151  \n",
              "2014-01-09                 0         1 -0.021531   4.1425  0.086168  "
            ],
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Open</th>\n",
              "      <th>High</th>\n",
              "      <th>Low</th>\n",
              "      <th>Close</th>\n",
              "      <th>Adj Close</th>\n",
              "      <th>Volume</th>\n",
              "      <th>Increase_Decrease</th>\n",
              "      <th>Buy_Sell_on_Open</th>\n",
              "      <th>Buy_Sell</th>\n",
              "      <th>Returns</th>\n",
              "      <th>Average</th>\n",
              "      <th>Std</th>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>Date</th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>2014-01-03</th>\n",
              "      <td>3.98</td>\n",
              "      <td>4.00</td>\n",
              "      <td>3.88</td>\n",
              "      <td>4.00</td>\n",
              "      <td>4.00</td>\n",
              "      <td>22887200</td>\n",
              "      <td>Increase</td>\n",
              "      <td>1</td>\n",
              "      <td>1</td>\n",
              "      <td>0.012658</td>\n",
              "      <td>3.9650</td>\n",
              "      <td>0.057446</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2014-01-06</th>\n",
              "      <td>4.01</td>\n",
              "      <td>4.18</td>\n",
              "      <td>3.99</td>\n",
              "      <td>4.13</td>\n",
              "      <td>4.13</td>\n",
              "      <td>42398300</td>\n",
              "      <td>Increase</td>\n",
              "      <td>1</td>\n",
              "      <td>1</td>\n",
              "      <td>0.032500</td>\n",
              "      <td>4.0775</td>\n",
              "      <td>0.092150</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2014-01-07</th>\n",
              "      <td>4.19</td>\n",
              "      <td>4.25</td>\n",
              "      <td>4.11</td>\n",
              "      <td>4.18</td>\n",
              "      <td>4.18</td>\n",
              "      <td>42932100</td>\n",
              "      <td>Decrease</td>\n",
              "      <td>1</td>\n",
              "      <td>0</td>\n",
              "      <td>0.012107</td>\n",
              "      <td>4.1825</td>\n",
              "      <td>0.057373</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2014-01-08</th>\n",
              "      <td>4.23</td>\n",
              "      <td>4.26</td>\n",
              "      <td>4.14</td>\n",
              "      <td>4.18</td>\n",
              "      <td>4.18</td>\n",
              "      <td>30678700</td>\n",
              "      <td>Decrease</td>\n",
              "      <td>0</td>\n",
              "      <td>0</td>\n",
              "      <td>0.000000</td>\n",
              "      <td>4.2025</td>\n",
              "      <td>0.053151</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2014-01-09</th>\n",
              "      <td>4.20</td>\n",
              "      <td>4.23</td>\n",
              "      <td>4.05</td>\n",
              "      <td>4.09</td>\n",
              "      <td>4.09</td>\n",
              "      <td>30667600</td>\n",
              "      <td>Decrease</td>\n",
              "      <td>0</td>\n",
              "      <td>1</td>\n",
              "      <td>-0.021531</td>\n",
              "      <td>4.1425</td>\n",
              "      <td>0.086168</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 3,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "dataset.shape"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 15,
          "data": {
            "text/plain": [
              "(1171, 12)"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 15,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "X = np.array(dataset['Open']).reshape(1171,-1)\n",
        "Y = np.array(dataset['Adj Close']).reshape(1171,-1)"
      ],
      "outputs": [],
      "execution_count": 23,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from sklearn.model_selection import train_test_split\n",
        "from sklearn import preprocessing\n",
        "\n",
        "X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=0)\n",
        "quantile_transformer = preprocessing.QuantileTransformer(random_state=0)\n",
        "X_train_trans = quantile_transformer.fit_transform(X_train)\n",
        "X_test_trans = quantile_transformer.transform(X_test)"
      ],
      "outputs": [],
      "execution_count": 24,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "np.percentile(X_train[:, 0], [0, 25, 50, 75, 100])"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 25,
          "data": {
            "text/plain": [
              "array([ 1.62    ,  2.7     ,  4.2     , 11.4275  , 24.940001])"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 25,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "np.percentile(X_train_trans[:, 0], [0, 25, 50, 75, 100])"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 26,
          "data": {
            "text/plain": [
              "array([9.99999998e-08, 2.50250250e-01, 4.99499499e-01, 7.50250250e-01,\n",
              "       9.99999900e-01])"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 26,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "np.percentile(X_train[:, 0], [0, 25, 50, 75, 100]) "
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 27,
          "data": {
            "text/plain": [
              "array([ 1.62    ,  2.7     ,  4.2     , 11.4275  , 24.940001])"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 27,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "np.percentile(X_test_trans[:, 0], [0, 25, 50, 75, 100])"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 28,
          "data": {
            "text/plain": [
              "array([0.00342075, 0.25725726, 0.55948203, 0.77948967, 0.99173318])"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 28,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Rescale Data        \n",
        "\nRescaling is the most simplest and one can do is to take a range of data and map it onto a zero-to-one scale."
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "from sklearn.preprocessing import MinMaxScaler\n",
        "\n",
        "scaler = MinMaxScaler(feature_range=(0, 1))\n",
        "rescaledX = scaler.fit_transform(X)\n",
        "# summarize transformed data\n",
        "np.set_printoptions(precision=3)\n",
        "print(rescaledX[0:5,:])"
      ],
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[[0.101]\n",
            " [0.102]\n",
            " [0.11 ]\n",
            " [0.112]\n",
            " [0.111]]\n"
          ]
        }
      ],
      "execution_count": 33,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Standardize Data        \n",
        "\n",
        "Standardization is a method that transfrom attributes with a Gaussian distribution and differing means and standard deviations to standard Gaussian distribution with a mean of 0 and a standard deviation of 1. \n",
        "\nThis method works best with rescaled data with linear regression, logistic regression and linear discriminate analysis."
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "from sklearn.preprocessing import StandardScaler\n",
        "\n",
        "scaler = StandardScaler().fit(X)\n",
        "rescaledX = scaler.transform(X)\n",
        "# summarize transformed data\n",
        "np.set_printoptions(precision=3)\n",
        "print(rescaledX[0:5,:])"
      ],
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[[-0.622]\n",
            " [-0.616]\n",
            " [-0.579]\n",
            " [-0.571]\n",
            " [-0.577]]\n"
          ]
        }
      ],
      "execution_count": 34,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Normalize Data\n",
        "\nNormalizing rescale each of observation (row) to have a length of 1 (called a unit norm in linear algebra). This method is very uselufl if the dataset has many zeros. "
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "from sklearn.preprocessing import Normalizer\n",
        "\n",
        "scaler = Normalizer().fit(X)\n",
        "normalizedX = scaler.transform(X)\n",
        "# summarize transformed data\n",
        "np.set_printoptions(precision=3)\n",
        "print(normalizedX[0:5,:])"
      ],
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[[1.]\n",
            " [1.]\n",
            " [1.]\n",
            " [1.]\n",
            " [1.]]\n"
          ]
        }
      ],
      "execution_count": 35,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Binarize Data (Make Binary)\n",
        "\n\nBinarize is use for transform dataing to threshold when the values above the threshold are marked 1 and all equal to or below are marked as 0.  This method is useful if you have probabilities that want to make crisp vcalues. "
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "from sklearn.preprocessing import Binarizer\n",
        "\n",
        "binarizer = Binarizer(threshold=0.0).fit(X)\n",
        "binaryX = binarizer.transform(X)\n",
        "# summarize transformed data\n",
        "np.set_printoptions(precision=3)\n",
        "print(binaryX[0:5,:])"
      ],
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[[1.]\n",
            " [1.]\n",
            " [1.]\n",
            " [1.]\n",
            " [1.]]\n"
          ]
        }
      ],
      "execution_count": 36,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    }
  ],
  "metadata": {
    "kernel_info": {
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.5.5",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python"
    },
    "kernelspec": {
      "name": "python3",
      "language": "python",
      "display_name": "Python 3"
    },
    "nteract": {
      "version": "0.11.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}